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Abstract 

Expectile, first introduced by Newey and Powell (1987) in the econometrics lit¬ 
erature, has recently become increasingly popular in risk management and capital 
allocation for financial institutions due to its desirable properties such as coherence 
and elicitability. The current standard tool for expectile regression analysis is the mul¬ 
tiple linear expectile regression proposed by Newey and Powell in 1987. The growing 
applications of expectile regression motivate us to develop a much more flexible non- 
parametric multiple expectile regression in a reproducing kernel Hilbert space. The 
resulting estimator is called KERE which has multiple advantages over the classical 
multiple linear expectile regression by incorporating non-linearity, non-additivity and 
complex interactions in the final estimator. The kernel learning theory of KERE is 
established. We develop an efficient algorithm inspired by majorization-minimization 
principle for solving the entire solution path of KERE. It is shown that the algorithm 
converges at least at a linear rate. Extensive simulations are conducted to show the 
very competitive finite sample performance of KERE. We further demonstrate the 
application of KERE by using personal computer price data. 
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1 Introduction 


The expectile introduced by Newey and Powell (1987) is becoming an increasingly popular 


tool in risk management and capital allocation for financial institutions. Let Y be a random 
variable, the cu-expectile of Y, denoted as f u , is defined by 


co = 


E{\Y - I u \Iy<!u) 


co G (0,1). 


( 1 ) 


E{\Y-f u \} ’ 

In financial applications, the expectile has been widely used as a tool for efficient estimation 


of the expected shortfall (ES) through a one-one mapping between the two (Taylor, 2008 


Hamidi et al. 2014 Xie et ah] 2014). More recently, many researchers started to advocate 
the use of the expectile as a favorable alternative to other two commonly used risk measures - 
Value at Risk (VaR) and ES, due to its desirable properties such as coherence and elicitability 
(Kuan et al. 2009 |Gneiting , 2011 Ziegel 2014). VaR has been criticized mainly for two 
drawbacks: First, it does not reflect the magnitude of the extreme losses for the underlying 
risk as it is only determined by the probability of such losses; Second, VaR is not a coherent 


risk measure due to the lack of the sub-additive property (Emrner et ah, 2013 Embrechts 


et al. 2014). Hence the risk of merging portfolios together could get worse than adding 


the risks separately, which contradicts the notion that risk can be reduced by diversification 


(Artzner et ah, 1999). Unlike VaR, ES is coherent and it considers the magnitude of the 


losses when the VaR is exceeded. However, a major problem with ES is that it cannot be 
reliably backtested in the sense that competing forecasts of ES cannot be properly evaluated 


through comparison with realized observations. Gneiting (2011) attributed this weakness to 


the fact that ES does not have elicitability. Ziegel (2014) further showed that the expectile 


are the only risk measure that is both coherent and elicitable. 

In applications we often need to estimate the conditional expectile of the response variable 
given a set of covariates. This is called expectile regression. Statisticians and Econometri¬ 
cians pioneered the study of expectile regression. Theoretical properties of the multiple linear 


expectile were firstly studied in Newey and Powell (1987) and Efron (1991). Yao and Tong 


(1996) studied a non-parametric estimator of conditional expectiles based on local linear 


polynomials with a one-dimensional covariate, and established the asymptotic property of 
the estimator. A semiparametric expectile regression model relying on penalized splines is 
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proposed by Sobotka and Kneib (2012). Yang and Zou (2015) adopted the gradient tree 


boosting algorithm for expectile regression. 

In this paper, we propose a flexible nonparametric expectile regression estimator con¬ 


structed in a reproducing kernel Hilbert space (RKHS) (Wahba 1990). Our contributions 
in this article are twofold: First, we extend the parametric expectile model to a fully non¬ 
parametric multiple regression setting and develop the corresponding kernel learning theory. 
Second, we propose an efficient algorithm that adopts the Majorization-Minimization prin¬ 
ciple for computing the entire solution path of the kernel expectile regression. We provide 
numerical convergence analysis for the algorithm. Moreover, we provide an accompanying R 
package that allows other researchers and practitioners to use the kernel expectile regression. 

The rest of the paper is organized as follows. In Section 2 we present the kernel expectile 
regression and develop an asymptotic learning theory. Section 3 derives the fast algorithm 
for solving the solution paths of the kernel expectile regression. The numerical convergence 
of the algorithm is examined. In Section 4 we use simulation models to show the high 
prediction accuracy of the kernel expectile regression. We analyze the personal computer 
price data in Section 5. The technical proofs are relegated to an appendix. 


2 Kernel Expectile Regression 


2.1 Methodology 


Newey and Powell (1987) showed that the cu-expectile / w of Y has an equivalent definition 

( 2 ) 


given by 


fu, = argmin E{<j) u (Y - /)}, 
/ 


where 




(1 — uj)t 2 t < 0, 


(3) 


ut 2 


t > 0. 


Consequently, Newey and Powell (1987) showed that the cu-expectile f w of Y given the set 
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of covariates X = x, denoted by / w (x), can be defined as 


/w(x) = argmin E{<f> u (Y - f) \ X = x}. 
/ 


( 4 ) 


Newey and Powell (|1987) developed the multiple linear expectile regression based on Q. 


Given n random observations (x 1; yi),--- , (x n ,?/ n ) with x, G M p and G M, Newey and 


Powell (1987) proposed the following formulation: 


(/3, fio) = argmin- Y] (pjy,, - xjf3 - /3 0 ). 
Wo) 


(5) 


2=1 


Efron 


(1991) proposed an efficient 


Then the estimated conditional (n-expectile is x^/3 + /30- 
algorithm for computing ([5]). 

The linear expectile estimator can be too restrictive in many real applications. Re¬ 
searchers have also considered more flexible expectile regression estimators. For example, [Yao 


and Tong (1996) studied a local linear-polynomial expectile estimator with a one-dimensional 


covariate. However, the local fitting approach is not suitable when the dimension of explana¬ 


tory variables is more than five. This limitation of local smoothing motivated Yang and Zou 


(2015) to develop a nonparametric expectile regression estimator based on the gradient tree 


boosting algorithm. The tree-boosted expectile regression tries to minimize the empirical 
expectile loss: 

1 JX 

( 6 ) 


min— yZ<fc(!/i-/( x i)), 
fGT n ^ 

1 = 1 

where each candidate function / G T is assumed to be an ensemble of regression trees. 

In this article, we consider another nonparametric approach to the multiple expectile 
regression. To motivate our method, let us first look at the special expectile regression with 
uj = 0.5. It is easy to see from ([3]) and Q that if u — 0.5, expectile regression actually 
reduces to ordinary conditional mean regression. A host of flexible regression methods have 
been well-studied for the conditional mean regression, such as generalized additive model, 
regression trees, boosted regression trees, and function estimation in a reproducing kernel 


Hilbert space (RKHS). Hastie et al. (2009) provided excellent introductions to all these 
methods. In particular, mean regression in a RKHS has a long history and a rich success 
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record (Wahba 1990). So in the present work we propose the kernel expectile regression in 


a RKHS. 

Denote by the Hilbert space generated by a positive definite kernel K. By the 

Mercer’s theorem, kernel AT has an eigen-expansion A'(x, x 7 ) = (x)<^j(x 7 ) with 

Vi > 0 and vf < oo. The function / in HR- can be expressed as an expansion of 
these eigen-functions /(x) = with the kernel induced squared norm 11 fil2 ~ 

c ?/d < 00 • Some most widely used kernel functions are 

• Gaussian RBF kernel A'(x. x 7 ) = exp 


Sigmoidal kernel AT(x, x 7 ) = tanhR (x, x 7 ) + 9), 
Polynomial kernel A'(x, x 7 ) = ((x, x 7 ) + 6) d . 


Other kernels can be found in Smola et al. (1998) and Hastie et al. (2009). 


Given n observations {(xj, t/i)}” =1 , the kernel expectile regression estimator (KERE) is 
defined as 

n 

” ' (7) 


(/n(x), d 0 ) = arg min Y] - a 0 - /(x;)) + 

f€M K ,a 0 £M. z ' 
i =1 


2 


where x, : e M p , cu 0 e M. The estimated conditional w-expectile is do + /„(x). Sometimes, 
one can absorb the intercept term into the nonparametric function /. We keep the intercept 
term in order to make a direct comparison to the multiple linear expectile regression. 

Although (J7| is often an optimization problem in an infinite-dimensional space, depend¬ 


ing on the choice of the kernel, the representer theorem (Wahba, 1990) ensures that the 


solution to (J7|) always lies in a finite-dimensional subspace spanned by kernel functions on 
observational data, i.e., 

n 

/( x ) = 5^Q!i-K'( x i, x ), (8) 


i =1 


for some {aj}" =1 C 


By ® and the reproducing property of RKHS (Wahba 1990) we have 


X) X) a ‘ a i K ( x - x .i1- 


(9) 


i=l j=1 


Based on ([8]) and 0 


we can 


rewrite the minimization problem ([7]) in a finite-dimensional 
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space 


{^*}i o arg min E 0 a; I Vi ^0 ^ ^j ^ ^ j ) I “1“ ^ EE cqaqA'^Xj). (10) 

W i=0 . =1 \ i=1 / i= i j=i 


The corresponding KERE estimator is do + x ). 

The computation of KERE is based on © and we use both 0 and fllO] ) for the theo¬ 
retical analysis of KERE. 


2.2 Kernel learning theory 

In this section we develop a kernel learning theory for KERE. We first discuss the criterion 
for evaluating an estimator in the context of expectile regression. Given the loss function (j> w , 
the risk is lZ(f , « 0 ) = E^^cp^y — a 0 — /(x)). It is argued that 7 Z(f, a 0 ) is a more appropriate 
evaluation measure in practice than the squared error risk defined as E , x ||/(x) + o; 0 — /*(x)|| 2 , 
where f* is the true conditional expectile of Y given X = x. The reason is simple: Let /, do 
be any estimator based on the training data. By law of large number we see that 

n(f, d 0 ) = E {yj>Xj}T=i — Mvj - «o - Ax,-)), 

j=i 

and 

^ m 

K{f, «o) = lim — Y] <f> u (yj - d 0 - /(xj)), 

m-¥ oo Tfl z ' 

3 = 1 

where {(xj, yj)}jE, is another independent test sample. Thus, one can use techniques such 
as cross-validation to estimate Additionally, the squared error risk depends on 

the function /*(x), which is usually unknown. Thus, we prefer to use !Z(f,a 0 ) over the 
squared error risk. Of course, if we assume a classical regression model (when u> = 0.5) 
such as y = /(x) + error, where the error is independent of x with mean zero and constant 
variance, 7 Z(f,a 0 ) then just equals the squared error risk plus a constant. Unfortunately, 
such equivalence breaks down for other values of u and more general models. 

After choosing the risk function, the goal is to minimize the risk. Since typically the 
estimation is done in a function space, the minimization is carried out in the chosen function 
space. In our case, the function space is RKHS generated by a kernel function K. Thus, the 
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ideal risk is defined as 


n 


* 

f,a o 


inf 

/GHx) a o^ 


K(f,a o). 


Consider the kernel expectile regression estimator (/, d 0 ) as defined in ([T]) based on a 
training sample D n , where D n = {(x*, r/,)}” =1 are i.i.d. drawn from an unknown distribution. 
The observed risk of KERE is 


K(f, do) = E^ y ^ u {y - d 0 - /(x)). 

It is desirable to show that 77(/,do) approaches the ideal risk 7 Zf ao . 

It is important to note that 7 Z(f, do) is a random quantity that depends on the training 
sample D n . So it is not the usual risk function which is deterministic. However, we can 
consider the expectation of 77(/,do) and call it expected observed risk. The formal definition 
is given as follows 


Expected observed risk: E Dn TZ(f, d 0 ) = E Dn {Efr, y )<f>u(y - d 0 - /(x))}. (11) 


Our goal is to show that 7 Z(f,a 0 ) converges to Llf ao - We achieve this by showing that 
the expected observed risk converges to the ideal risk, i.e., lim, woo E Dn lZ(f, d 0 ) = Llf ao - By 
definition, we always have 7 Z(f, do) > Then by Markov inequality, for any £ > 0 


P 


(nl 


at oj 


77 


/>“ 0 


> £ < 


7?D n 77(/, d 0 ) - lZ* f 


a o 


-)■ 0 . 


The rigorous statement of our result is as follows: 


Theorem 1. Let M = sup x 77(x, x) 1//2 . Assume M < oo and Ey 2 < D < oo where M and 
D are two constants. If X is chosen such that as n —>• oo, X/n 2E —* oo, X/n —* 0, then we 
have 

EdM/, d 0 ) -»■ 7^} lQo as n oo, 


and hence 


77(/, do) — 77^ ao —> 0 m probability. 


The Gaussian kernel is perhaps the most popular kernel for nonlinear learning. For the 
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Gaussian kernel A'(x, x') = exp(—||x — x'|| 2 /c), we have M — 1. For any radial kernel with 
the form A (x, x') = /i(||x —x'||) where h is a smooth decreasing function, we see M = h( 0)s 
which is finite as long as h( 0) < oo. 


3 Algorithm 

3.1 Derivation 

Majorization-minimization (MM) algorithm is a very successful technique for solving a wide 


range of statistical models (Lange et al. 2000 Hunter and Lange, 2004; Wu and Lange, 2010 


Zhou and Lange, 2010 Lange and Zhou, 2014). In this section, we develop an algorithm 


inspired by MM principle for solving the optimization problem (10). Note that the loss 


function <f> u in (10) does not have the second derivative. We adopt the MM principle to find 


the minimizer by iteratively minimizing a surrogate function that majorizes the objective 


function in (10). 


To further simplify the notation we write ct = (a 0 , aq, a 2 , ■ ■ ■ , ct n ) T , and 


Kj = (1, A'(xj, xi),... ,K (x;,x n )), K = 


/ iF(xi,x 1 ) Ji(xi,x n ) ^ 


y /i(x„,X!) ■■■ if(x„,x n ) j 


Kn = 


0 

Onxl 


0 T 

u nxl 

K 


Then (10) is simplihed to a minimization problem as 

ol = argminFb^a), 

a 

n 

F w ,\ (a) = ^4>uj (Vi ~ Kja) + Aa T K 0 a, 

i =1 


( 12 ) 

(13) 


where cu is given for computing the corresponding level of the conditional expectilc. We also 
assume that A is given for the time being. A smart algorithm for computing the solution for 


a sequence of A will be studied in Section 3.3 



























Our approach is to minimize (12) by iteratively update at using the minimizer of a 


majorization function of Fu,\{cx). Specifically, at the fc-th step of the algorithm, where 
k = 0,1,2,..., assume that is the current value of at at iteration k, we find a majorization 
function Q(at \ ol for F u , i ^(ct) at current a ^ that satishes 


Q(at | at ( ^) >F u1 ,\{ol) when at ^ ct^ k \ 
Q(at | at^) =F u] ,\{ol) when at = at^. 


(14) 

(15) 


Then we update at by minimizing Q(at \ ofi®) rather than the actual objective function 
F u ,\{a): 

a ( k + 1 ) — argminQ(a | ct^). (16) 

OL 

To construct the majorization function Q(at | q^) for Fu,\(ot) at the k -th iteration, we use 
the following lemma: 

Lemma 1 . The expectile loss (f) u has a Lipschitz continuous derivative (j)' u , i.e. 


10M ~ 0L( 6 )l < L \ a ~ b \ Va,6eM, 


(17) 


where L = 2max(l — co,co). This further implies that 0^ has a quadratic upper bound 


fa) < 0^(6) + 0^(6) (a - h) + —\a - b\ 2 Va, b e M. 


(18) 


Note that “=” is taken only when a = b. 


Assume the current “residual” is r\ k ' > — y % — K iaS k \ then it is equivalent in (12) that 
y t — K iOt = r— Ki(a — at^). By lemma [lj we obtain 

10L( r f } - K i(« - « (fc) )) - 0L( r f } )l < 2 max(l — u, cu) |Kj(a - at (k) ) |, 

and the quadratic upper bound 


0 w (rf ) - Ki(ac - ct {k) )) < qfat \ ot [k) ), 
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where 


qi{oL | a (fc) ) = (j) u ( rf ] ) - ^(r-^)Kj(a - a (fc) ) + max(l - a;, a;) (a - a (fc) ) T KjKj(a - a (fc) ). 


Therefore the majorization function of F u ^(ol) can be written as 


Q{a | a (fc) ) = ^2qi(a \ a< fc >) + Aa T K 0 a, 

i =1 


(19) 


which has an alternatively form that can be written as 


Q(a | a<*>) = F w , A (aW) + VF u , A (a«)(a - a^) + (a - a«) T K a (a - a«), (20) 


where 


K u = AK 0 + max(l — cu, oj) K,K| 


i —1 


= max(l — uj,lo) 


n 


K1 KK 


1 T K 

A 

max(l— cj,cj) 


K 


( 21 ) 

( 22 ) 


and 1 is an nxl vector of all ones. Our algorithm updates a. using the minimizer of the 


quadratic majorization function (20): 


Q ,(fc+ 1 ) _ argminQ(a | cx (fc) ) = + K M 1 J -AK 0 o: (fc) + - ^0^(rf ')K, j . (23) 


2—1 


The details of the whole procedures for solving (12) are described in Algorithm [lj 


3.2 Convergence analysis 


Now we provide the convergence analysis of Algorithm [T] Lemma [2] below shows that the se¬ 
quence (a^) in the algorithm converges to the unique global minimum « of the optimization 
problem. 


Lemma 2. If we update aA fc+1 ) by using (23), then the following results hold: 

1. The descent property of the objective function. Fu^a^ 1 ') < F^^ct^), Vk. 
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Algorithm 1 The algorithm for the minimization of (12) 


• Let {yi}i be observations of the response, {K(x,, Xj)}” J=1 be the kernel of all obser¬ 
vations, and ct := (cto, «i, « 2 , • • •, ct n ). 

• Initialize ab 0 -* and k = 0. 

• Iterate step 1-3 until convergence: 

1. Calculated the residue of the response by — y t — K iOt^ for all 1 < i < n. 

2. Obtain a^ +1 ^ by: 

a (M) = a tt) + K-> [ -AK„ a <‘) + t IKi) , 


i =1 


where 


3. k k + 1. 


Tl 1 J K 

K u = max(l -u,u) [ K1 KK + -a- K 

max(l— uj,io) 


2. The convergence of a. Assume that ^” =1 K*KJ is a positive definite matrix, then 
lim^oo ||a (A:+1) - CK (fc) || = 0. 


3. The sequence ( ct converges to ct, which is the unique global minimum of (12) 


Theorem 2. Denote by ct the unique minimizer of (12) and 


At = 


Q(a | o: (fc) ) - Fyjfia) 
ct — o:( fc )) T K a (a — ct^l) 


(24) 


Note that when A k = 0, it is just a trivial case ct i - j) = ct for j > k. We define 


r = i- 7min (K- 1 K / ), 


where 

n 

K, = AK 0 + min(l - u, u) ^ K,K[. 

i= 1 

Assume that Y^i=i is a positive definite matrix. Then we have the following results: 

1. F WiA (a( fc+1 )) - F UtX (a) < A k (^(a^) - i^a)) . 
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2. The sequence {F^^ol^)) has a linear convergence rate no greater than T, and 0 < 

Afc < r < i. 

3. The sequence (a®) has a linear convergence rate no greater than A/r 7 max (K tt )/ 7 min (K/), 

i.e. 

y 'Tmin(K/J 

Theorem [2] says that the convergence rate of Algorithm [T| is at least linear. In our numeric 
experiments, we have found that Algorithm [T| converges very fast: the convergence criterion 
is usually met after 15 iterations. 


3.3 Implementation 

We discuss some techniques used in our implementation to further improve the computational 
speed of the algorithm. 

Usually expectilc models are computed by applying Algorithm [T| on a descending sequence 
of A values. To create a sequence {A m }^f =1 , we place M — 2 points uniformly (in the log-scale) 
between the starting and ending point A max and A min such that the A sequence length is M. 
The default number for M is 100, hence Ai = A max , and Ai 0 o = A min . We adopt the warm- 
start trick to implement the solution paths along A values: suppose that we have already 
obtained the solution 6t\ m at A m , then will be used as the initial value for computing 
the solution at A m+ i in Algorithm [lj 

Another computational trick adopted is based on the fact that in Algorithm [lj the inverse 
of K u does not have to be re-calculated for each A. There is an easy way to update K" 1 for 
Ai, A 2 , • •.. Because K u can be partitioned into two rows and two columns of submatrices, 


by Theorem 8.5.11 of Harvillc (2008), can be expressed as 


K, (A) = - jz -r 

max(l — u,uj) 


max(l — u,u) 


-i 


n 1 T K 

K1 KK A 


max(l— uj , uj ) 


K 



) + l 

' -^ 1TK ] 

/ ' 

( In ) 


Qa(—K1,I, 

n 


(25) 
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where 


Qa 1 = 


KK 


A 


max(l — oj, oj) 


K 


n -1 


—K11 T K 

n 


In (25) only Q A X changes with A, therefore the computation of Kjj 1 for a different A only re¬ 
quires the updating of Q A k Observing that Q A * is the inverse of the sum of two submatrices 
A and B: 


A>=KK 


A 


max(l — oj, oj) 


K. B =-K11 T K. 

n 


By Sherman-Morrison formula (Sherman and Morrison, 1950), 


Qt = [A, + B]- 1 = A, 1 - — A a -‘BA a - 


(26) 


where g = trace(BA A 1 ), we find that to get Q A * for a different A one just needs to get A a x , 
which can be efficiently computed by using eigen-decomposition K = UDU T : 


A a x = ( KK 


A 


max(l — oj, oj) 


-l 


K = U D 


A 


max(l — oj, oj) 


-l 

I„ ) U T . 


(27) 


(27) implies that the computation of K U 1 (A) depends only on A, D, U and uj. Since D, U 


and oj stay unchanged, we only need to calculate them once. To get K, U 1 (A) for a different 


A in the sequence, we just need to plug in a new A in (27) 


The following is the implementation for computing KERE for a sequence of A values 
using Algorithm 1: 

• Calculate U and D according to K = UDU T . 

• Initialize ct\ 0 = [0, 0,..., 0]. 

• for m — 1,2,, M, repeat step 1-3: 

1. Initialize a l 0 '* = 6t\ 


2. Compute K u 1 (A m ) using (25), (26) and (27). 

3. Call Algorithm [l] to compute ct\ m . 

Our algorithm has been implemented in an official R package KERE, which is publicly 
available from the Comprehensive R Archive Network at http://cran.r-project.org/ 
web/packages/KERE/index.html. 
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4 Simulation 


In this section, we conduct extensive simulations to show the excellent finite performance 
of KERE. We investigate how the performance of KERE is affected by various model and 
error distribution settings, training sample sizes and other characteristics. Although many 
available, throughout this section we use the commonly recommended (Hastie 

-llxi-Xjll 2 

) Gaussian radial basis function (RBF) kernel ih(xj,Xj) = e , We select 

the best pair of kernel bandwidth a 2 and regularization parameter A by two-dimensional 
five-fold cross-validation. All computations were done on an Intel Core i7-3770 processor at 
3.40GHz. 


kerne 


et al. 


s are 


2009 


Simulation I: single covariate case 

The model used for this simulation is defined as 


. / n 7 \, X i> \ x i\ + 1 
Vi = sm(0.7xj) + — H--— €i, 


(28) 


which is heteroscedastic as error depends on a single covariate x ~ U[— 8,8]. We used a 
single covariate such that the estimator can be visualized nicely. 

We used two different error distributions: Laplace distribution and a mixed normal dis¬ 
tribution, 

e i ~ 0.5IV(0, -) + 0.5iV(l, —). 

4 lo 


We generated n = 400 training observations from (28), on which five expectile models 
with levels u> = {0.05,0.2,0.5,0.8,0.95} were fitted. We selected the best (a 2 , A) pair by 
using two-dimensional, five-fold cross-validation. We generated an additional n' = 2000 
test observations for evaluating the mean absolute deviation (MAD) of the final estimate. 
Assume that the true expectile function is f w and the predicted expectile is then the 
mean absolute deviation are defined a 


1 


MAD(ca) = — l/«(x») - /w(xi 


i =1 


The true expectile f u is equal to sin(0.7.x') + |^ + ^g^6 w (e), where 6 w (e) is the cu-expectile of 
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Theoretical Expectiles 


Empirical Expectiles 



Figure 1: Theoretical expectiles and empirical expectiles for a covariate heteroscedas- 
tic model with mixed normal error. The model is fitted on five expectile levels uj = 
{0.05,0.2,0.5,0.8,0.95}. 

e, which is the theoretical minimizer of E(f> u (e — b ). 

The simulations were repeated for 100 times under the above settings. We recorded 
MADs for different expectile levels in Table [l] We find that the accuracy of the expectile 
prediction with mixed normal errors is generally better than that with Laplace errors. For 
the symmetric Laplace case, the prediction MADs are also symmetric around to = 0.5, while 
for the skewed mixed-normal case the MADs are skewed. In order to show that KERE 
works as expected, in Figure [l] we also compared the theoretical and predicted expectile 
curves based on KERE with uj = {0.05,0.2,0.5,0.8,0.95} in Figure [lj We can see that the 
corresponding theoretical and predicted curves are very close. Theoretically the two should 
become the same curves as n —> oo. 
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IX 

0.05 

0.2 

0.5 

0.8 

0.95 

Mixture 

Laplace 

0.236 (0.003) 
2.346 (0.013) 

0.138 (0.003) 
1.037 (0.007) 

0.376 (0.002) 
0.179 (0.005) 

0.610 (0.002) 
1.033 (0.006) 

0.788 (0.002) 
2.333 (0.027) 


Table 1: The averaged MADs and the corresponding standard errors of expectile regression 
predictions for single covariate heteroscedastic models with mixed normal and Laplace error. 
The models are fitted on five expectile levels co = {0.05,0.2,0.5,0.8,0.95}. The results are 
based on 300 independent runs. 

Simulation II: multiple covariate case 

In this part we illustrate that KERE can work very well for target functions that are non¬ 
additive and/or with complex interactions. We generated data {xj,?/j}” =1 according to 


Vi = /i(x*) + |/ 2 (x;)|ei, 


where predictors x, ; was generated from a joint normal distribution iV(0,I p ) with p = 10. 
For the error term e* we consider three types of distributions: 

1. Normal distribution e 

i r ^ 1 N( 0,1). 

2. Student’s f-distribution with four degrees of freedom ~ f 4 . 

3. Mixed normal distribution e* ~ 0.91V(0,1) + 0.11V(1,4). 


We now describe the construction of f\ and / 2 . In the homoscedastic model, we let 


/ 2 (xj) = 1 and fi is generated by the “random function generator” model (Friedman 2000), 
according to 

20 

/( x ) = 


i=i 


where {a;}/£ 1 are sampled from uniform distribution a/ ~ U[— 1,1], and X/ is a random subset 
of p-dimensional predictor x, with size pi = min({1.5 + r, pj), where r was sampled from 
exponential distribution r ~ Exp( 0.5). The function gi(xi) is an p;-dimensional Gaussian 
function: 


gi(xi) = exp 


2 ( x ^ - Mi) Ty /( x /-M/) 
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where n l follows the distribution N(Q, I Pf ). The pi x pi covariance matrix V; is defined by 
V/ = U}D/U[, where U) is a random orthogonal matrix, and D; = diag(dn, d 2 /, ■ ■ ■ ,d Pl i ) 
with \fdji ~ U[ 0.1,2], 

In the heteroscedastic model, fi is the same as in the homoscedastic model and / 2 is 
independently generated by the “random function generator” model. 

We generated n = 300 observations as the training set, on which the estimated expectilc 
functions f w were computed at seven levels: 

u G {0.05,0.1,0.25,0.5,0.75,0.9,0.95}. 

An additional test set with n' = 1200 observations was generated for evaluating MADs 
between the fitted expectilc f u and the true expectilc f u . Note that the expectile function 
/ w (x) is equal to /i(x) + b u (e) in the homoscedastic model and /i(x) + |/ 2 (x)|6 a; (e) in the 
heteroscedastic model, where b u (e) is the cn-expectile of the error distribution. Under the 
above settings, we repeated the simulations for 300 times and record the MAD and timing 
each time. 

In Figure [2] and [3] we show the box-plots of empirical distributions of MADs, and in 
Table [2] we report the average values of MADs and corresponding standard errors. We see 
that KERE can deliver accurate expectile prediction results in all cases, although relatively 
the prediction error is more volatile in the heteroscedastic case as expected: in the mean 
regression case (u = 0.5), the averaged MADs in homoscedastic and heteroscedastic models 
are very close. But this difference grows larger as c o moves away from 0.5. We also observe 
that the prediction MADs for symmetric distributions, normal and t 4 , also appear to be 
symmetric around the conditional mean oj = 0.5, and that the prediction MADs in the 
skewed mixed-normal distribution cases are asymmetric. The total computation times for 
conducting two-dimensional, five-fold cross-validation and fitting the final model with the 
chosen parameters (cr 2 , A) for conditional expectiles are also reported in Table [3] We find 
that the algorithm can efficiently solve all models under 20 seconds, regardless of choices of 
error distributions. 

We next study how sample size affects predictive performance and computational time. 
We fit expectile models with lu G {0.1, 0.5, 0.9} using various sizes of training sets n G {250, 
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Figure 2: Homoscedastic models with error distribution (a) normal, (b) t 4 distribution, 
(c) mixed normal. Box-plots show MADs based on 300 independent runs for expectiles 
u E {0.05,0.1,0.25,0.5,0.75,0.9,0.95}. 



CO 


Figure 3: Heteroscedastic models with error distribution (a) normal, (b) t 4 distribution, 
(c) mixed normal. Box-plots show MADs based on 300 independent runs for expectiles 
u E {0.05,0.1,0.25,0.5,0.75,0.9,0.95}. 
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Homoscedastic model Heteroscedastic model 


u 

Normal 

t-4 

Mixture 

Normal 

U 

Mixture 

0.05 

0.4068 

0.4916 

0.4183 

0.6009 

0.8035 

0.6142 


(0.0039) 

(0.0061) 

(0.0046) 

(0.0079) 

(0.0103) 

(0.0066) 

0.1 

0.3975 

0.4529 

0.4019 

0.5067 

0.6315 

0.5052 


(0.0037) 

(0.0051) 

(0.0037) 

(0.0056) 

(0.0094) 

(0.0054) 

0.25 

0.3717 

0.4145 

0.3886 

0.4065 

0.4648 

0.4173 


(0.0031) 

(0.0042) 

(0.0038) 

(0.0042) 

(0.0061) 

(0.0047) 

0.5 

0.3750 

0.4069 

0.3851 

0.3712 

0.4038 

0.3886 


(0.0032) 

(0.0038) 

(0.0032) 

(0.0042) 

(0.0049) 

(0.0045) 

0.75 

0.3782 

0.4261 

0.4102 

0.4185 

0.4702 

0.4635 


(0.0033) 

(0.0042) 

(0.0036) 

(0.0046) 

(0.0064) 

(0.0057) 

0.9 

0.3932 

0.4553 

0.4356 

0.4968 

0.6226 

0.6203 


(0.0038) 

(0.0050) 

(0.0045) 

(0.0058) 

(0.0081) 

(0.0076) 

0.95 

0.4040 

0.4925 

0.4628 

0.5938 

0.8078 

0.7631 


(0.0046) 

(0.0062) 

(0.0054) 

(0.0066) 

(0.0128) 

(0.0102) 


Table 2: The averaged MADs and the corresponding standard errors for fitting homoscedastic 
and heteroscedastic models based on 300 independent runs. The expectilc levels are u E 
{0.05, 0.1, 0.25, 0.5, 0.75,0.9, 0.95}. 


UJ 

Homoscedastic model 

Heteroscedastic model 

Normal 

U 

Mixture 

Normal 

U 

Mixture 

0.05 

19.04 

21.47 

17.10 

16.90 

17.60 

17.95 

0.1 

14.25 

16.89 

13.91 

14.38 

14.60 

15.21 

0.25 

11.67 

15.25 

13.59 

12.30 

12.49 

12.36 

0.5 

10.54 

14.09 

12.18 

10.92 

11.13 

11.01 

0.75 

8.24 

15.33 

10.47 

12.48 

12.48 

12.38 

0.9 

10.08 

14.39 

12.46 

14.67 

15.25 

14.52 

0.95 

12.16 

19.90 

15.17 

17.34 

17.75 

16.61 


Table 3: The averaged computation times (in seconds) for fitting homoscedastic and 
heteroscedastic models based on 300 independent runs. The expectilc levels are oj G 
{0.05, 0.1, 0.25, 0.5, 0.75,0.9, 0.95}. 
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n 


Error 



Timing 


250 

500 

750 

1000 

250 

500 

750 

1000 

uj = 0.1 

0.4824 

0.4084 

0.4047 

0.3887 

8.739 

56.188 

168.636 

382.897 

u = 0.5 

0.3329 

0.2977 

0.2732 

0.2544 

6.028 

43.802 

159.398 

329.646 

uj = 0.9 

0.6341 

0.5861 

0.5563 

0.5059 

9.167 

56.533 

173.359 

386.345 


Tabic 4: The averaged MADs and the corresponding averaged computation times (in sec¬ 
onds) are reported. The size of the training set varies from 250 to 1000. The size of the test 
data set is 2000. All models are fitted on three expectilc levels: (a) uj — 0.1, (b) uj = 0.5 and 
(c) uj = 0.9. 

500, 750, 1000} and evaluate the prediction accuracy of the estimate using an independent 
test set of size n' = 2000. We then report the averaged MADs and the corresponding 
averaged timings in Tabic [4j Since the results are very close for different model settings, 
only the result from the heteroscedastic model with mixed-normal error is presented. We find 
that the sample size strongly affects predictive performance and timings: large samples give 
models with higher predictive accuracy at the expense of computational cost - the timings 
as least quadruple as one doubles sample size. 


5 Real data application 


In this section we illustrate KERE by applying it to the Personal Computer Price Data 


studied in Stengos and Zacharias (2006). The data collected from the PC Magazine from 


January of 1993 to November of 1995 has 6259 observations, each of which consists of the 
advertised price and features of personal computers sold in United States. There are 9 main 
price detriments of PCs summarized in Table [5] The price and the continuous variables 
except the time trend are in logarithmic scale. We consider a hedonic analysis, where the 
price of a product is considered to be a function of the implicit prices of its various compo¬ 


nents, see Triplett (1989). The intertemporal effect of the implicit PC-component prices is 


captured by incorporating the time trend as one of the explanatory variables. The presence 
of non-linearity and the interactions of the components with the time trend in the data, 


shown by Stengos and Zacharias (2006), suggest that the linear expectile regression may 


lead to a misspecified model. Since there lacks of a general theory about any particular 
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ID 

Variable 

Explanation 

1 

SPEED 

clock speed in MHz 

2 

HD 

size of hard drive in MB 

3 

RAM 

size of RAM in in MB 

4 

SCREEN 

size of screen in inches 

5 

CD 

if a CD-ROM present 

6 

PREMIUM 

if the manufacturer was a “premium” firm (IBM, COMPAQ) 

7 

MULTI 

if a multimedia kit (speakers, sound card) included 

8 

ADS 

number of 486 price listings for each month 

9 

TREND 

time trend indicating month starting from Jan. 1993 to Nov. 1995 


Table 5: Explanatory variables in the Personal Computer Price Data (Stengos and Zacharias 


2006) 


functional form for the PC prices, we use KERE to capture the nonlinear effects and higher 
order interactions of characteristics on price and avoid severe model misspecification. 

We randomly sampled 1/10 observations for training and tuning with two-dimensional 
five-fold cross-validation for selecting an optimal (a 2 , A) pair, and the remaining 9/10 obser¬ 
vations as the test set for calculating the prediction error defined by 

1 ' 

prediction error = — Y]<f> u {yi - A,(xQ). 
n 

i =1 

For comparison, we also computed the prediction errors using the linear expectilc regres¬ 
sion models under the same setting. All prediction errors are computed for seven expectilc 
levels l o G {0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95}. We repeated this process 100 times and 
reported the average prediction error and their corresponding standard errors in Table [6] 
We also showed box-plots of empirical distributions of prediction errors in Figure [4j We see 
that for all expectilc levels KERE outperforms the linear expectile model in terms of both 
prediction error and the corresponding standard errors. This shows that KERE offers much 
more flexible and accurate predictions than the linear model by guarding against model 
misspecification bias. 
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Personal Computer Price Data 

U) 

0.05 

0.1 

0.25 

0.5 

0.75 

0.9 

0.95 

Linear 

5.727 

3.396 

5.722 

7.078 

6.032 

3.814 

2.517 


(0.013) 

(0.010) 

(0.015) 

(0.017) 

(0.015) 

(0.014) 

(0.012) 

KERE 

3.970 

2.523 

3.952 

4.749 

4.094 

2.684 

1.868 


(0.013) 

(0.010) 

(0.015) 

(0.017) 

(0.015) 

(0.014) 

(0.012) 


Table 6: The averaged prediction error and the corresponding standard errors for the 
Personal Computer Price Data based on 100 independent runs. The expectilc levels are 
u G {0.05, 0.1, 0.25, 0.5, 0.75,0.9, 0.95}. The numbers in this table are of the order of 10~ 3 . 


(o = 0.05 


co = 0.75 


(0 = 0.1 


KERE Linear 


ft) = 0.9 


KERE Linear 


ft) = 0.25 


KERE Linear 


co = 0.95 


KERE Linear 


(0 = 0.5 



KERE Linear 


Figure 4: Prediction error distributions for the Personal Computer Price Data using the 
linear expectilc model and KERE. Box-plots show prediction error based on 100 independent 
runs for expectiles u G {0.05,0.1,0.25,0.5,0.75,0.9,0.95}. The numbers in this table are of 
the order of 10 -3 . 
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Appendix: Technical Proofs 


5.1 Some technical lemmas for Theorem |T| 

We first present some technical lemmas and their proofs. These lemmas are used to prove 
Theorem [l] 


Lemma 3. Let 0* be the convex conjugate of 


= 


hj)t 2 ift<0, 


4(1 


l ft> o. 


The solution to (10) can be alternatively obtained by solving the optimization problem 


min g(ai, a 2 ,..., a n ), subject to > a* = 0, 

<»>?- ti 


(29) 


where g is defined by 


g(ai,a 2 ,...,a n ) = - S ^2 / y i a i + - ^ a i a j K(x i ,^ j ) + 2A ^ </>*(«*). (30) 


2—1 


i,j= 1 


i=l 


Proof. Let a = (cti, a 2 , ■ ■ ■, a n ) T . Since both objective functions in (10) and (29) are convex, 
we only need to show that they share a common stationary point. Define 


G w {ot) — (f>uj(a i) + <f u (a 2 ) + • • • + (f>u{oi n ), 
VG u (a) = (0L(«l), 0L(«2), • • •, 0L(«n)) T - 


By setting the derivatives of (10) with respect to ct to be zero, we can find the stationary 


point of (10) satisfying 


_d_ 

da: 



f yi - «o ^ 



0L(l/i - «o - E"=i K(xt, Xj)a 3 ) 


«£5 

to 

... | 

s 

0 

- Ka 


~ «o - E”=i K ( x * x i) a i) 


^ Un j 



4*w(yn Ej=l ^ { X nt x j)®-j) 


+ A —cP Kol = 0, 
do: 
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which can be reduced to 


4>' u (yi - Q 0 - ^ K(xi, Xj)oij) + 2Actj = 0, for 1 < i < n, 

3 =1 


and setting the derivative of (10) with respect to «o to be zero, we have 


XX (k - «0 - = 0. 


2—1 


3 = 1 


Combining (31) and (32), (32) can be simplified to 


= 0 - 


2—1 


In comparison, the Lagrange function of (29) is 


g(ai, « 2 , ■ ■ ■, Oi n ) + v 




2=1 


The first order conditions of (34) are 


Vi + v + ^ if (a;*, Xj)aj + 2A0* , (cp) = 0, for 1 < i < n, 
l=i 


and 


yy a. = 0. 

2=1 

Noting that 2A0* , (cp) = 0*'(2Acp) and q 6*' is the inverse function of <//,. Let ^ 


(31) and (35) are equivalent. Therefore, (10) and (29) have a common stationary 


therefore a common minimizer. 


Lemma 4. 


yy Qj-if (Xj,Xj) < y/K(Xi,Xi 
1=1 


N 


yyyy aja ,/f(x t .x ) 

*=1 1=1 


(31) 


(32) 


(33) 


(34) 


(35) 

(36) 

= ao, then 
r point and 

□ 
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Proof. Let C = K 1//2 , then by Cauchy-Schwarz inequality 


^ ^ OijK (x^, Xj) (^1? ^2? • • • 5 ^ 72 )C(C^l, C ij2 , • • • ? C^n) 
j=l 


— II (^1? ^2? • • • ? ||*|| ^z,2? • • • ? ^i,r, 


, EE OliOijKfei, Xj) • y/K(Xi,Xi). 

\ i= 1 1=1 


□ 


Lemma 5. For the g function defined in (30), we /mve 


^(ai - a*) (ay - ay) A' (xj,Xj) + 


A 


_! max(l — tu, w) ^ 

i,j=i v ’ ’ i=i 

<g(ai, a 2 , ■ ■ ■, «„) - g(&i,a 2 , • • •, a n ) 




1 

<- 

“2 


A 


E (cij Qiifio'j CXj)K (Xj, Xj) "A . , > ^ '(^i bj) • 

2 mm 1 — a;,a;) 

; ,-i V ’ ' i=l 


i,j =1 


Proof. It is clear that the second derivative of q is bounded above by K H—-H and 

bounded below by K + ^ I, where K e R n ’”. Let ol = (ai,a 2 ,..., a n ) J 





1 

A 



g(oi) 

-0(a) 

<e 

1 

1— 

<e 

VI 

+ X 

(K + 

T)(a — a ) 1 (ck — a), 

(37) 




2 

nnn(l — 00 , u) 



d(oc) 

-0(a) 

<s 

1 

1— 

s 

"O; 

Al 

1 

+ x 1 

(K +- jf- -; 

-I)(a — S) T (a — S). 

(38) 




2 

maxfl — c j, oj 




Hence when ot and ci are fixed and gfiot) = 0, the maximum of g(ot) — g(6t) is obtained 
when the second order derivative of g achieves its maximum and the minimum is obtained 
when the second order derivative achieves its minimum. □ 


The next lemma establishes the basis for the so-called leave-one-out analysis (Jaakkola 


and Haussler[ 1999 Joachims 2000 Forster and Warmuth 2002; Zhang 2003). The basic 


idea is that the expected observed risk is equivalent to the expected leave-one-out error. Let 
D n+ 1 = {(xj, be a random sample of size n + 1, and let D ^ + , be the subset of D n+ 1 

with the i-th observation removed, i.e. 



{( x l; • • ) ( x i— 1 1 Vi—l)i ( x *+l j Vi+ 1) j ■ • • j ( x n+l; 2/n+l)}• 
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Let (/H , d^) be the estimator trained on D^ +1 . The leave-one-out error is defined as the 
averaged prediction error on each observation (x, ; , yf) using the estimator (/+ ) computed 

from D^ +1 , where (x, ; , yf) is excluded: 


Leave-one-out error: 


1 

n + 1 


n+1 

Y faiy* ~ “o 1 

i =1 


/'+ i)). 


Lemma 6. Let (/(n),do(n)) &e t/ie KERE estimator trained from D n . The expected observed 
risk E^E^y'jCpufy — d 0 (n) — /(n)( x )) zs equivalent to the expected leave-one-out error on 
Dn -\-1 • 


n+1 


^D n {^(x,y)0a;(Z/ - «0(n) ~ /(n)( x ))} = E Dn+1 “ 4'’ “ / W ( X +) , (39) 

2=1 


where 6$ and /M are KERE trained from D„. 


n+l ‘ 


Proof. 


n+l 


n+l 


J D n+ 1 


(y+y X] ^ Vi ~ “o 1 - / w ( x *))) = ^+1 E D n+ Mvi ~ 4 1 - / w ( x i)) 
2=1 2=1 


2=1 

n+l 


Y E dM { E {^,yiMyi ~ 4 ] - / [ll ( x *))} 


n+l Z ^n + l 

i= 1 
t n+l 


n + l 


5^-S£» n {-S(x,y)0 W (j/ - «0 - /( x ))} 


2=1 


= E Dn E {Xty) (j) u (y -d 0 - /(x)). 


0 


In the following Lemma, we give an upper bound of |d,| for 1 < i < n. 

Lemma 7. Assume M = sup x K (x,x) 1 / 2 . Denote as (/( n ),do(n)) t/ie KERE estimator 
in ([7]) trained on n samples D n = {(x*, i/j)}” =1 . The estimates d*( n ) for 1 < i < n are 
defined by /(„)(■) = EIU d i ( n )A'(x i , •). Denote ||F n || 2 = ^YTi=iVh = p E"=i \Vi\> 
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_ maxiiu^uj} _ max n — u u \ w e claim that 

^ mm(l— uj , uj ) 7 v ’ / 


«i(n)| < y + m 0a + !) Jypnlh + |j/i|), f°rl<i< 


n. 


Proof. The proof is as follows. The function g is defined as in (30), then 


(40) 


Q^P ^-1 (n) 5 ^2 (n) ? • • • i (n)) — O5 


we have 


^ ^ (ri)^j (n)-K(p^-ii ^ ^ Vi^i (n) ^A ^ ^ (n)) 


M=1 


2=1 


2=1 


A 


n 

^ 2=1 


92 


92 \ ' 2 

2A& 

2=1 


< * 
“ 2A 




2=1 


Applying Lemma [4j we have 


n 


f (n) (Xj) = 2^d i(n )/t (Xi,Xj) < M 
3 =1 


E n o 

i=i y 


' 92 , 


A =MW^||r n || 2 . 


(41) 


By the definition in (10), do(n) is given by argmin ao X)" =1 4 >uj(yi — a o — /( n )(x, ; )). By the first 
order condition 

n 

^ 2 \u - I(Vi~ d 0 (n) - /(n)(Xi))|(Z/i - «0(n) “ /(n)(x*)) = 0 . 

2=1 

Let q = |tu — /(y, — d 0 (n) — /(n)( x i))|, we have min(l — cu,cn) < q < max(l — hence 


Z Ci )“°( 


(n) 


2=1 


< 


| ^Ci(yi - /(n)(Xi)) < Z Ci (M + |/hh( X 0|) 
2=1 2=1 
n I - 

92 (Z |y*| + nA/Wyll^Hs), 


2=1 
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and we have 


(42) 


|do(n)| < Ql 


\\Yr 


n 1 


n 


Combining (31) and (42), we concluded (40). 



□ 


5.2 Proof of Theorem [l] 

Proof. Consider n + 1 training samples D n+ 1 = {(xx, yi), ..., (x n+1 , y n+1 )}. Denote as 
(/W,do^) the KERE estimator trained from D^ + x , which is a subset of D n+ \ with i-th ob¬ 
servation removed, i.e., 



{ ( x l) 2/l) j ■ ■ • j ( X i~ 1) Vi— 1)> ( x i+l ; Vi+1 )■>■■■ i ( x n+l; 2/n+l)}- 


Denote as (/( n+ i), d 0 (n+i)) the KERE estimator trained from n + 1 samples D 
estimates dj for 1 < i < n + 1 are defined by /( n+1 )(-) = djA'(xj, •). 

In what follows, we denote \\Y n+ i\\ 2 = ^i' 1 = ^ 9i = 

q 2 = max(l — uj,cu), q 3 = min(l — uj,uj). 


n+ 1- The 


max(l-u,w) 
min(l—’ 


Part I We first show that the leave-one-out estimate is sufficiently close to the estimate 
fitted from using all the training data. Without loss of generality, just consider the case that 
the (n + l)th data point is removed. The same results apply to the other leave-one out cases. 
We show that |/[” +1 ](xj) + do n+1 ^ — /( n+ i)(xj) — do(n+i)l < C^ +1 \ where the expression of 
Cf +1] is to be derived in the following. 

We first study the upper bound for |/[ n+1 ]( X j) — /( n + 1 )(xj)|. By the deffiiitions of g in 
( |30j ) and (dj n+1 ', 6tfff +1 \ ..., b|r +1 ^), we have 


9 (d[" +11 ,4 n+11 . 

/ ~ [n+1] ~ [n+1] 

= g{(Xi ,«2 , 


,dir +11 ,o) 

n+l]\ 

1 ) 


, a 


1 . „ 1 , . 1 , X 

9[oil H—CR+i) oli H—cx n+ i,..., ol u -t— oi n+ 1 ) 
n n n 

. 1 ^ 1 ^ 1 ^ . 

9 { a i H— a n+1 ) Qg H— oi n+ 1,..., a n -t— a n+ i, 0). 

v n n n 


< 









That is, 


/ a [n+1] a [n+1] ~ [n+ll ^ \ 

g{a[ \af, \ ..., < J , Oj - g{a u a 2 , ..., a n+1 ) 

, 1 „ A 1 „ „ 1 A „ N 

< glcxi H— a n+ 1, «2 H— a n+ 1,..., a n H—ct n +i> 0 ) — fl'( Q; i) a 2, ■ ■ ■, a n+ i ) • 
n n n 

Denote for simplicity that = 0. Applying Lemma [ 5 ] to both LHS and RHS of the 

above inequality, we have 


n+1 


i,j =1 


n+1] 


A 


n+1 


«*)(«; - +)A'(x l ,x j ) + 

^ 2 i=l 


n+1] 




<«;+i 


1 ‘,-ikfl, ...,l,- 1 f+ A <" + 1 > 

n ’ n J \n’ ’ n' J 2 nq$ 


where K G M n+1 > n+1 is defined by K, :j = A"(x, : , x+ Since |AT (xj,Xj)| < M 2 for any 1 < 
i,j <n + l, we have 


, — 1)K( 1 

n ' ' n' 1 V n' ’ n ' 

— _L V" TV _ I V n TV _ i k 1 TV 

— n 2 Z^ij=l xv *,j n Z^i=1 xv *,n+1 n Z~/j=l xv n+l,j T xv n+ i in _(_i 

< M 2 + M 2 + M 2 + M 2 = 4M 2 . 


Combining it with the bound for |d n +i| by Lemma [7] (note that here a n+ i is trained on n + 1 
samples), we have 


where 


n+1 


^(df +i] - +)(dj n+1] - aj)K(xi,Xj) < Cf +1] , 


i,j= 1 


(43) 


cr il = [ 4M 2 + A+±l) | cr . 


(44) 


and 


CcT +1 ' — 9i~—r+r - + M(qi + l) \[^r\\Y n+ i\\ 2 + \y n +i\- 

n + 1 V A 


(45) 
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Combining (43) with Lemma [4j we have that for 1 < i < n + 1, 

n+1 , - 

l/ >+ 1 | W-/(-.+i ) WI = |E(“!” +I1 -“<) ii '( x ‘> x j) i \lcf*' ] M. (46) 


3 =1 


Next, we bound |dg n+1 ^ — do (n+i)|- Since do(n+i) and are the minimizers of 


n+1 

y + 

2=1 

n 

(y* - «o - f(n+i) (xj)) and <+ 

i=l 

+ - «o - / [n+1] (x0), 


we have 

d n+1 

^ ^ «0 /(n+l)(xj)^ 

^ 2=1 

= 0, 

«0=«0(n+l) 

(47) 

and 

i n 

2=1 

~ [n+1] ^ 

ao=&o 

(48) 

By the Lipschitz continuity of (f>' u we have 



£«&( 

y* - d 0 (n+i) - / [n+1] ( x i)) - Eri 1 

E (y< do(n+l) /(n+l)(Xj)^ 



< 2 (n + l)g 2 |/ [n+1] ( x *) - f{n+i) (x*)|, 
and by applying (46) and (47) we have the upper bound 

< 2(n + l)g 2 ^/cf +1] M. 


n+1 


2—1 


0L (?/i - «0(n+l) - / [n+1] (Xi)) 
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Similarly, by (41), (42), and (48) we have 


< 


EILl (yi - «0(n+l) - / [n+1] (Xi)) 

Eri 1 (yi - «0(n+l) - / [n+1] (Xi)) - ES (y* - «0(n+l) - /(n+l)( x 0) 
-tfw (Vn +1 - do(n+l) - / [n+ 1 ] (x„+l)) 

E2? <E {yi - « 0 (n+l) - / [n+1] (x,)) - EIL 4 ! 1 (y* - ^(n+l) - /(n+l)(Xi)) 


+ 


(49) 


0L (yn+l - «0(n+l) - / [n+1] (x n+1 )) 

< 2(n + V)q 2 \j C\ + ^ M + 2g 2 (|yn+i| + |«o(re+i)| + |/(n)|) 

< 2(n + l)q 2 \Jc[ + ' M + 2g 2 (|yn+i| + + -^?i\/^-||^+i|| 2 + V^Fll ^Ih) 

< 2(n + l)g 2 ^/cf +1] M + 2g 2 Cf +1] , 


where the second last inequality follows from (41) and (42). Note that in this case the 
corresponding sample is n + 1. 


Using (48) we have 


< 


2nq 3 \a l Q + - d 0 (n+i)| 

n n 

| Y ^ { Vi ~ «o n+1] _ / [ " + 1 I (x*)) - Y ( Vi - «0(n+i) - / [n+ 1 ] (x,:)) 


Y { Vi - “0("+1) - / [ ” + 1 ] (x ? :)) 

i— 1 


By (49), we have 


l«!T 1] — «0(n+l)| < 9l^(l + “)'\/ < 


+ -) \/c'l n+11 M + -U ^ 11 

n v n 


(50) 


Finally, combining (46) and (50) we have 


l/ [ ” +1| (x,) + 4” + " - /(n+llfXi) - Q0(„+1)| < Cf +I1 , 


(51) 
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where 


C? +1] = Ql f(l + Vcf +1] M + -Cf +1] ) + Jc [ ? +1] M. (52) 

V n v n / v 


Part II We now use ( |5T| ) to derive a bound for cf) u (y n+ 1 — d[, n+1 ^ — +W (x n+] )). Let 
t = /(n+i)(xi) +d 0 (n+i) - / [n+1I (xj) - df +l1 and t' = y t - d 0 (n+i) - /(n+i)( x + We claim that, 


<j>w(t + t ') - </><+') < q 2 (\2tt'\ + |f 2 |). 


(53) 


when (t+t 1 ) and t' are both positive or both negative, (53) follows from (t+t') 2 —t' 2 = 2 tt'+t 2 . 


When t + t' and t' have different signs, it must be that \t'\ < |t|, and we have \t\ = \t + t'\ + \t'\ 


and hence |i + £'| < \t\. Then (53) is proved by <+ (t + t 1 ) — = max(0 w (t + 1 1 ), <j) u (t!)) < 

q 2 max((f + t') 2 , t' 2 ) < max(l —u,uj)t 2 < max(l — u,u)(\2tt'\ + |t 2 |). 

Hence by (|5T|), (53) and the upper bound of \y n +i — /(n+i)( x n+i) — b 0 (n+i)|, we have 


MVn+l ~ 4” +11 ~ / [ ” +1| (x„+,)) < MVn+1 ~ “0(n+l) ~ /(n+l) (x>i+l)) + C+ 1] , 


(54) 


where 


<+ +11 = <j 2 (2<+ +1| <+ +11 + (<+ +11 ) 2 ). 


(55) 


Note that (54) and (55) hold for other i, 1 < i < n + 1. 


<f>u(Vi - ®0 ] - / W ( X i)) < <f>u(Vi - do(n+l) - f(n+l) ( x i)) + C. 


3 • 


(56) 


Hence by (44), (45), (52) and (54) we have 


E 


D n+1 (<+ + - «o ] - / W ( x +) < E D n+ 1 (+(?/i - do (i) - /(n+i)( x +) + E Dn+1 Cf. (57) 


and 


n+l 


< 


n + l 
1 

n + l 


E Dn+1 ( Mi ~ + _ / W ( x i))) 


2=1 

n+l 


Ep n+1 f ^ ^ fiujyi ^0(n+l) /(n+l)( x 2))^ 

2=1 


n+l 


n + l 


E+- < 58 ) 


Ed u+ 1 / , '-'3 


2=1 
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On the other hand, let (/*, 0 +) in the RKHS and satisfy 7 Z(f*, 0 +) < inf/gH ^,oosk 7£(/, ct 0 ) + 
£. From the definition of a 0 ( n +i) and f( n +i) we have 


n+1 


< 


f ^ ^ + + do (n+1) /(n+l)( x i))^ ~b ^ _|_ jJI/(n+l)| 

i=l 

n+1 . 

n + 1 {J2^ - “oe - //( x /») + ^qryll/s 


n + 1 
1 


2 

H K 


p* || 2 

l£ IIHk- 


(59) 


2—1 


By Lemma |6j ([58]) and (59), we get 


^'D n {-E'(x,y) ( t > cd{y do (n) /(n)( x ))} 

. n+1 

-£d„ + 1 ( ^ - df 1 - / W (Xi))) 


n + 1 


i=l 


< ^r»„{-E (x>y) 0 w (j/-Q!S e -*/ e *(xi))} + 


'll/, 


* 112 


+ 


-E 


n+1 \Uem K n+1 V n +1 '3 

2=1 


n+1 

Ed 


< inf 7 Z(f,a 0 )+£-\ -1|/* 

“ fem K ,a 0 eR w ' n + l" £ 


+ n + 1 


£ 


D. 


n+1 


n+1 

Es 

2=1 


3 • 


(60) 


Because A/n —> 0, there exists N e such that when n > N e , y+11 f*\\n K < £■ In what follows, 
we show that there exists N’ e such that when n > N' £ , Eo n+1 ^” = / 1 ' < e. Thus, when 

n > max(7V e , iV £ ) we have 


^ n {^(x,y)+(z/ - d 0( n) - f (n)( x ))} < . inf K{f,a 0 ) + 3s. 

Since it holds for any £ > 0, Theorem [l] will be proved. 

Now we only need to show that ^^E Dn+1 Y^i=i —> 0 as n —* oo. In fact we can 

show ^pyl+ n+1 < ~^D (li 21 + l) —» 0 as n —» oo. In the following analysis, C 

represents any constant that does not depend on n, but the value of C varies in different 
expressions. Let R = gi 11 ^ 1 / 1 +M(q l + l)^/f\\Y n+1 \\ 2 + \y i \, then as n -+ cx), 4M 2 < 
and we have the upper bound 

c|‘ i <(ca)(+) 2 = +, 
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and since n > y/X asymptotically, we have 


cf < cl cvcf 1 + -) + C\/cf < + C— < C-. L 

' V n) v n y/\ 


Then 


Vi 


V 2 

r n 


cf < CViCf + ccf 2 < 

v A A vA 


(61) 


We can bound V; as follows: 


Vi — ?i-—-yy—f M(qi + 1)4 /^-||W+i ||2 + \Ui 
n + 1 V A 

< gi !!t+h + M( 9l + l) ^\\Y n+1 \\ 2 +\ yi 


yjn + 1 


A 


< C 


IliWilll 


A 


+ C\yi 


Then we have 


E Dn+1 V 2 < 2C 2 E 


D n -\. 1 


\\Y n+ l\\i , 2 

A Vi . 


(62) 


Combining it with (61) and using the assumption Eyf < D , we have 


71+1 


n + 1 




2=1 


C 1 (1 + n 


< 


y/X 1 + n \ A 
c E\\Y n+1 \\l (1 + n 


E\\Y n+l \\l + E\\Y n+l \\\ 


y/~X 1+n \ A 




So when A/n 2 / 3 -> 00 we have ^E Dn+1 Cf -)• 0. 
This completes the proof of Theorem [l] 


□ 
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5.3 Proof of Lemma |T| 

Proof. We observe that the difference of the first derivatives for the function satisfies 


WM ~ 0LO)I = 


2(1 — co)\a — b\ if (a<0,5<0), 
2u\a — b\ if (a>0,6>0), 

2 | (1 — a;) a — oob\ if (a<0,5>0), 
2\coa — (1 — u>)b\ if (a>0,fe<0). 


Therefore we have 


\4>'M - <fM\ < L\a - b\ Vo, 6, 


(63) 


where L = 2 max(l— uj, uf). By the Lipschitz continuity of (f w and Cauchy-Schwarz inequality, 


(a - b) < L\a - b\ 2 Va.b e 


(64) 


If we let <p u (a) = ( L/2)a 2 — then (64) implies the monotonicity of the gradient 

(pf u {a) = La — 0^(a). Therefore <p is a convex function and by the first order condition for 
convexity of 

<Pu(a) > <Pu(b) + p' u {b)(a - b) Va, &6l, 


which is equivalent to (18). 


□ 


5.4 Proof of Lemma [2] 

Proof. 1. By the definition of the majorization function and the fact that is the 


minimizer in (16) 


F Ut A (a (fe+1) ) < Q(a (fc+1) | « (fc) ) < Q(« (fc) \ a (k) ) = F^ x {ct^). 


2. Based on (20) and the fact that Q is continuous, bounded below and strictly convex, 
we have 

0 = VQ(« (fc+1) | o (fc) ) = + 2K u (« (fc+1) - « (fc) ). (65) 
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Hence 


F w> A (a (fc+1) ) < Q(a (fc+1) | 

= F WiA (a {fe) ) + VF u , A (a W )(a li;+1) - a (fc) ) + (a (fc+1) - a {fc) ) T K u (a (fe+1) - ot {k) ) 
= F UiX (a w ) - (a (fc+1) - a (fc) ) T K u (a (fc+1) - a (fc) ). 


By (21) and the assumption that Y^i=i K«Kj is positive definite, we see that K u is also 
positive definite. Let 7 min (K u ) be the smallest eigenvalue of K u then 


0 < 7min(K u )||a( fc+1 )-aW|| 2 < < F Ut x(a^)-F u ,x{a^). 

( 66 ) 

Since F is bounded below and monotonically decreasing as shown in Proof 1, F Ui x(a^) — 
F u ,x(a< k+1 )) converges to zero as k —> oo, from (66) we see that lim^oo ||a (fc+1 ' > — a^|| = 0. 

3. Now we show that the sequence ( a converges to the unique global minimum of 
(12). As shown in Proof 1, the sequence is monotonically decreasing, hence 

is bounded above. The fact that (F WjA (a®)) is bounded implies that (a^) must also 
be bounded, that is because lim^-joc F Ut x(oi) = oo. We next show that the limit of any 
convergent subsequence of (a^) is a stationary point of F. Let (cd^) be the subsequence 
of ( a.W) and let linij_j.no ot^ = a, then by (65) 


o = vg(a (fci+1) I a (fcl) ) = VF w , A (a ft) ) + 2K u (a (fei+1) - a (fci) ). 


Taking limits on both sides, we prove that 5 is a stationary point of F. 


0 = lim VQ(c* (fci+1) | a^ ki) ) = VQ(lim a (fei+1) | lim a {ki) ) 

z—> oo i—¥ oo z—>■ oo 

= VP^ A (fi) + 2K u (a - a) = VF Ui x(a). 


Then by the strict convexity of F. we have that a is the unique global minimum of (12). lg§ 
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5.5 Proof of Theorem [2] 


Proof. 1. By (14) and (16), 


< Q(« (fc+1) | a^) < Q(A k otW + (l - A k )& | a«). 


(67) 


Using (24) we can show that 


Q( A fc a« + (l-A fc )a|aW) 

=i^ A (c^) + (1 - A*)VF WiA (a< fc >)(a - + (1 - A,) 2 (a - a«) T K u (a - a«) 

=A*F w , A (aW) + (1 - A fc ) [Q(a | a^) - A fc (a - a w ) T K„(h - a«)] 

=A fc F w , A (aW) + (1 - A k )F u>x (a). (68) 


Then the statement can be proved by substituting (68) into (67). 
2. We obtain a lower bound for F UtX (a.) 


F W) A(a) > F w>A (a< fc) ) + VF^(a w )(a - a (fc) ) + (a - a (fc) ) T K ; (a - a^), (69) 


and majorization Q(S | a®) 


Q(a | a (fc) ) = F w , A (a ( *>) + VF W]A (a (fc ))(a - a w ) + (a - c^) T K„(S - a< fc) ). (70) 


Subtract (69) from (70) and divide by (S — a^) T K u (a — a®), we have 


Qja. | o: (fc) ) - F UiX (a) 

(a — cx( k ))TK u (a — ah fc )) 

(a — ct( k )) T K/(a — aW) 
~ (a - aW) T K tt (a - a®) 

— 1 0 / min(K~ 1 Kj). 


(71) 


Both K v and Ki are positive dehnite by the assumption that Y^i =i K,K| is positive definite, 


and since 

K“ 1 K / = Ku^Ku^K z Ku^K|, 


37 











the matrix K x Kj is similar to the matrix K„ 2 K/K U 2 , which is positive dehnite. Hence 


r = 1 - 7 min(K- 1 K,) = 1 - 7min (K u 2 K,K U 2 ) < 1. 


By (14) and ( |7lj ) we showed that 0 < A k < T <1. 

3. Since VF Wi >(ci) = 0, using the Taylor expansion on at a, we have 

F u , x (a^) - F UtX (a) > (a™ - aYK,^ - a) > 7m in(K ,)\\a^ - a|| 2 , 

F^a.W) - F UtX (a) < (q« - a) T K M (a« - a) < 7 max (K u )||«( fc ) - a|| 2 . 
Therefore, by Results 1 and 2 

(*+!)_ ~„ 2 / ^,A(a (fe+1) ) - F UiX (S) ^ r(F^( a (V) - F UiX (S)) ^ 


ct x 


-a h < 


'Tmin(K/ 


< 


7 min(K( 


< r- 


'Tmin(K/ 


a^ — all 2 . 


□ 
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